Patient Selection for Diabetes Drug Testing

Exploratory Data Analysis

Criteria Meet Specification

The project uses visualizations to analyze which fields have a high amount of missing/null values or high cardinality.

The project correctly identified which field(s) has/have a high amount of missing/zero values.

The project correctly identified which field(s) has/have a Gaussian distribution shape based on the histogram analysis.

The project correctly identified fields with high cardinality.

The project justified why these fields had high cardinality.

The project correctly describes the distributions for age and gender.

Optional: The project uses Tensorflow Data Validation Visualizations to analyze which fields have a high amount of missing/null values or high cardinality.

The project selects features based on exploratory data analysis.

The project correctly identifies whether to include/exclude payer_code and weight fields.

The project justified why these fields should be included/excluded by using supporting data analysis.

Data Preparation

Criteria Meet Specification

The project uses the correct level(s) for the given EHR dataset (line, encounter, patient) and transforms, aggregates and filters appropriately.

The project uses the correct level(s) for the given EHR dataset (line, encounter, patient) and transforms, aggregates and filters appropriately.

The project correctly reduces dimensionality for a dataset containing a high cardinality code set field.

The project correctly maps NDC codes to generic drug names and prints out the correct mappings in the notebook.

The project dataset has been split correctly for EHR machine learning models.

The project has correctly split the original dataset into train, validation, and test datasets.

The projects dataset splits do not contain patient or encounter data leakage.

The Projects code passes the Encounter Test.

Feature Engineering

Criteria Meet Specification

The project correctly creates categorical features using Tensorflow’s feature column API for transforming the data.

The project correctly completes the categorical feature transformer boilerplate function.

The project successfully uses this function to transform the demo dataset with at least one new categorical feature.

The project correctly creates numerical features using Tensorflow’s feature column API for transforming the data.

The project correctly completes the numerical feature transformer boilerplate function.

The project successfully uses this function to transform the demo dataset with at least one new numerical feature.

The project's transformer function correctly incorporates the provided z-score normalizer function for normalization or another custom normalizer.

Model Building and Analysis

Criteria Meet Specification

The project correctly utilizes the regression model predictions for uncertainty estimation and classification output analysis.

The project has prepared the regression model predictions for TF Probability and binary classification outputs by doing the following:
Correctly utilized TF Probability to provide mean and standard deviation prediction outputs
Created an output prediction dataset that has the labels correctly mapped to a binary prediction and actual value.

The project correctly evaluates the model predictions across key classification metrics.

The model has been evaluated across the following classification metrics: AUC, F1, precision, and recall.

Students have completed both questions for the model summary and address bias-variance tradeoff in regard to this problem.

The project utilizes the Aequitas toolkit to correctly create a bias report for race and gender on a provided dataset with some justification for their analysis.

The project contains a bias report with the following:

  • A visualization of at least two key metric(s) for patient selection
  • A visualization showing at least one reference group fairness example and its comparison on at least one metric (e.g. TPR).
  • Justification for analysis made about at least one visualization

Tips to make your project standout:

  1. Students can tune the TF Probability library to show and rank uncertainty estimates.
  2. Students can use cross-feature columns.
  3. Students can use ICD groupings to reduce dimensionality.
  4. Students can build a more complex representation of the data - e.g. conditional probability matrix, pre-trained embeddings, etc.
  5. Students uses Tensorflow Data Validation Visualizations to analyze which fields have a high amount of missing/null values or high cardinality.